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Abstract 

Many interesting machine learning problems are best 
posed by considering instances that are distributions, 
or sample sets drawn from distributions. Previous work 
devoted to machine learning tasks with distributional 
inputs has done so through pairwise kernel evalua¬ 
tions between pdfs (or sample sets). While such an ap¬ 
proach is fine for smaller datasets, the computation of 
N X N Gram matrix is prohibitive in large datasets. 
Recent scalable estimators that work over pdfs have 
done so only with kernels that use Euclidean metrics, 
like the L2 distance. However, there are a myriad of 
other useful metrics available, such as total variation, 
Hellinger distance, and the Jensen-Shannon divergence. 
This work develops the first random features for pdfs 
whose dot product approximates kernels using these 
non-Euclidean metrics, allowing estimators using such 
kernels to scale to large datasets by working in a pri¬ 
mal space, without computing large Gram matrices. We 
provide an analysis of the approximation error in using 
our proposed random features and show empirically the 
quality of our approximation both in estimating a Gram 
matrix and in solving learning tasks in real-world and 
synthetic data. 


Introduction 

As machine learning matures, focus has shifted towards 
datasets with richer, more complex instances. For example, 
a great deal of effort has been devoted to learning func¬ 
tions on vectors of a large fixed dimension. While com¬ 
plex static vector instances are useful in a myriad of ap¬ 
plications, many machine learning problems are more natu¬ 
rally posed by considering instances that are distributions, or 
sets drawn from distributions. Political scientists can learn a 
function from community demographics to vote percentages 
to understand who supports a candidate (Flaxman, Wang, 
and Smola 2015). The mass of dark matter halos can be in¬ 
ferred from the velocity of galaxies in a cluster (Ntampaka 
et al. 2015). Expensive expectation propagation messages 
can be sped up by learning a “just-in-time” regression model 
(Jitkrittum et al. 2015). All of these applications are aided by 
working directly over sets drawn from the distribution of in¬ 
terest, rather than having to develop a per-problem ad-hoc 
set of summary statistics. 

‘These two authors contributed equally. 


Distributions are inherently infinite-dimensional objects, 
since in general they require an infinite number of parame¬ 
ters for their exact representation. Hence, it is not immedi¬ 
ate how to extend traditional finite vector technique machine 
learning techniques to distributional instances. However, re¬ 
cent work has provided various approaches for dealing with 
distributional data in a nonparametric fashion. For example, 
regression from distributional covariates to real or distribu¬ 
tional responses is possible via kernel smoothing (Poczos et 
al. 2012a; Oliva, Poczos, and Schneider 2013), and many 
learning tasks can be solved with RKHS approaches (Muan- 
det et al. 2012; Poczos et al. 2012b). A major shortcoming 
of both approaches is that they require computing N kernel 
evaluations per prediction, where N is the number of train¬ 
ing instances in a dataset. Often, this implies that one must 
compute a N X N Gram matrix of pairwise kernel evalu¬ 
ations. Such approaches fail to scale to datasets where the 
number of instances N is very large. Another shortcoming 
of these approaches is that they are often based on Euclidean 
metrics, either working over a linear kernel, or one based on 
the L 2 distance over distributions. While such kernels are 
useful in certain applications, better performance can some¬ 
times be obtained by considering non-Euclidean based ker¬ 
nels. To this end, Poczos et al. (2012b) use a kernel based 
on Renyi divergences; however, this kernel is not positive 
semi-definite (PSD), leading to even higher computational 
cost and other practical issues. 

This work addresses these major shortcomings by devel¬ 
oping an embedding of random features for distributions. 
The dot product of the random features for two distribu¬ 
tions will approximate kernels based on various distances 
between densities (see Eigure 1). With this technique, we 
can approximate kernels based on total variation, Hellinger, 
and Jensen-Shannon divergences, among others. Since there 
is then no need to compute a Gram matrix, one will be able 
to use these kernels while still scaling to datasets with a 




Eigure 1; We approximate kernels between densities Pi,Pj 

. , , ^ ^ , iid iid 

With random features of sample sets Xi ^ Pi^Xj ^ Pj- 









large number of instances using primal-space techniques. 
We provide an approximation bound for the embeddings, 
and demonstrate the efficacy of the embeddings on both real- 
world and synthetic data. To the best of our knowledge, this 
work provides the first non-discretized embedding for non- 
L 2 kernels for probability density functions. 

Related Work 

The two main lines of relevant research are the development 
of kernels on probability distributions and explicit approxi¬ 
mate embeddings for scalable kernel learning. 

Learning on distributions In computer vision, the pop¬ 
ular “bag of words” model (Leung and Malik 2001) repre¬ 
sents a distribution by quantizing it onto codewords (usually 
by running fc-means on all, or many, of the input points from 
all sets), then compares those histograms with some kernel 
(often exponentiated x^)- 

Another approach estimates a distance between distri¬ 
butions, often the L 2 distance or Kullback-Leibler (kl) 
divergence, parametrically (Jaakkola and Haussler 1998; 
Moreno, Ho, and Vasconcelos 2003; Jebara, Kondor, and 
Howard 2004) or nonparametrically (Sricharan, Wei, and 
Hero 2013; Krishnamurthy et al. 2014). The distance can 
then be used in kernel smoothing (Poczos et al. 2012a; Oliva, 
Poczos, and Schneider 2013) or Mercer kernels (Moreno, 
Ho, and Vasconcelos 2003; Kondor and Jebara 2003; Jebara, 
Kondor, and Howard 2004; Poczos et al. 2012b). 

These approaches can be powerful, but usually require 
computing an x TV matrix of kernel evaluations, which 
can be infeasible for large datasets. Using these distances in 
Mercer kernels faces an additional challenge, which is that 
the estimated Gram matrix may not be PSD, due to estima¬ 
tion error or because some divergences in fact do not induce 
a PSD kernel. In general this must be remedied by altering 
the Gram matrix a “nearby” PSD one. Typical approaches 
involve eigendecomposing the Gram matrix, which usually 
costs 0{N^) computation and also presents challenges for 
traditional inductive learning, where the test points are not 
known at training time (Chen et al. 2009). 

One way to alleviate the scaling problem is the Nystrom 
extension (Williams and Seeger 2001), in which some 
columns of the Gram matrix are used to estimate the re¬ 
mainder. In practice, one frequently must compute many 
columns, and methods to make the result PSD are known 
only for mildly-indefinite kernels (Belongie et al. 2002). 

Another approach is to represent a distribution by its mean 
RKHS embedding under some kernel k. The RKHS inner 
product is known as the mean map kernel (mmk), and the 
distance the maximum mean discrepancy (mmd) (Gretton et 
al. 2009; Muandet et al. 2012; Szabo et al. 2015). When k is 
the common RBF kernel, the MMK estimate is proportional 
to an L 2 inner product between Gaussian kernel density es¬ 
timates. 

Approximate embeddings Recent interest in approxi¬ 
mate kernel embeddings was spurred by the “random 
kitchen sink” (RKS) embedding (Rahimi and Recht 2007), 


which approximates shift-invariant kernels K on by sam¬ 
pling their Fourier transform. 

A related line of work considers additive kernels, of the 
form K{x,y) = usually defined on Mio 

(e.g. histograms). Maji and Berg (2009) construct an embed¬ 
ding for the intersection kernel via step 

functions. Vedaldi and Zisserman (2010) consider any ho¬ 
mogeneous K, SO that K{tx,ty) = tK{x,y), which also al¬ 
lows them to embed histogram kernels such as the additive 
kernel and Jensen-Shannon divergence. Their embedding 
uses the same fundamental result of Fuglede (2005) as ours; 
we expand to the continuous rather than the discrete case. 
Vempati et al. (2010) later apply RKS embeddings to obtain 
generalized RBF kernels (1). 

For embeddings of kernels on input spaces other than 
the RKS embedding extends naturally to locally compact 
abelian groups (Li, lonescu, and Sminchisescu 2010). Oliva 
et al. (2014) embedded an estimate of the L 2 distance be¬ 
tween continuous densities via orthonormal basis functions. 
An embedding for the base kernel k also gives a simple 
embedding for the mean map kernel (Flaxman, Wang, and 
Smola 2015; Jitkrittum et al. 2015; Lopez-Paz et al. 2015; 
Sutherland and Schneider 2015). 

Embedding Information Theoretic Kernels 

For a broad class of distributional distances d, including 
many common and useful information theoretic divergences, 
we consider generalized RBF kernels of the form 

K{p,q) =exp(^-^d'^{p,q)^ . (1) 

We will construct features z{A{-)) such that K{p,q) « 
z{A{p))~^z{A{q)) as follows: 

1. We define a random function t/> such that d{p,q) « 

\\ip{p) — ^( 9 ) 11 , where tp{p) is a function from [0,1]^ to 
]g 2 M jjjg metric space of densities with distance d 

is approximately embedded into the metric space of 2M- 
dimensional L 2 functions. 

2. We use orthonormal basis functions to approximately em¬ 
bed smooth L 2 functions into finite vectors in . Com¬ 
bined with the previous step, we obtain features A{p) G 

such that d is approximated by Euclidean dis¬ 
tances between the A features. 

3. We use the RKS embedding z^) so that inner products 
between z{A{-)) features, in approximate K{p, q). 

We can thus approximate the powerful kernel K without 
needing to compute an expensive N x N Gram matrix. 

Homogeneous Density Distances (HDDs) 

We consider kernels based on metrics which we term homo¬ 
geneous density distances (HDDs): 

d^{p,q)= K{p{x),q{x))dx, (2) 

“'[ 0 . 1 ]'’ 

where k(x, y) : R+ x R+ —)■ R+ is a negative-type kernel, 
i.e. a squared Hilbertian metric, and K{tx,ty) = tK{x,y) 
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for all t > 0. Table 1 shows a few important instances. Note 
that we assume the support of the distributions is contained 
within [ 0 , 1 ]^. 

Name K{p{x),q{x)) d/r(A) 

Ere{p,9} ( p(")+?(^) ^ cosh(^AKl+A^) 

5 (^^/p{x) - \/q{x)j i(5(A = 0)dA 

TV \p{x)-q{x)\ 

Table 1: Squared HDDs. JS is Jensen-Shannon divergence; 
H is Hellinger distance; TV is total variation distance. 


Finite Embeddings of L2 

If densities p and q are smooth, then the L 2 metric between 
the px and qx functions may be well approximated using 
projections to basis functions. Suppose that {ipi}i^z is an 
orthonormal basis for L 2 {[ 0 , 1 ]); then we can construct an 
orthonormal basis for L 2 ([ 0 , 1 ]^) by the tensor product: 

i 

{‘Pajaez^! where (Pa(x) = W'Pai{Xi), X e [0,1]^ 
V/ e L2([0, 1]^), /(x) = ^ a„(/) Pa{x) 


We then use these distances in a generalized RBF kernel 
(1). d is a Hilbertian metric (Fuglede 2005), so K is positive 
definite (Haasdonk and Bahlmann 2004). Note we use the 
VTV metric, even though TV is itself a metric. 


Embedding HDDs into L2 

Fuglede (2005) shows that k corresponds to a bounded mea¬ 
sure /i(A), as in Table 1, with 

Kix,y)= f |x5+‘^-y5+‘^|2d^(A). (3) 

•/e>o 

Let Z = /r(i?>o) and ca = (—5 + iX)/{\ + iA); then 

n{x,y) ='Rxr^iz\9x{x) - 9\{y)? 

where gx{x) = xfZcxixh^^^ — 1 ). 

We can approximate the expectation with an empirical 

mean. Let \j ^ for j G {1,..., M}; then 


n{x,y) 


T7^I5a, (x)-5a, {y)?- 


Hence, using 91 ,3 to denote the real and imaginary parts: 


dHp,9)= f ^ipix),q{x))dx 


= ^\^^\9\iPix)) - 9\i9ix))\‘^dx 

d[o,iV 

~ / {{'^i9Xjip{x)))-3\{gx^{q{x)))y 

+ {'3{g\yp{x)))-3{gxyq{x)))f)dx 
= IIV’(f)-■*/'(?) f, (4) 


where [ip{p)]{x) is given by 

^ (pai ix),---,Px,^{x),p{yx),..., p{^ (x)), 

defining (x) = m(gxyp(x))), p^Jx) = J(gxyp(x))). 
Hence, the HDD between densities p and q is approximately 
the L 2 distance from i/’(p) to t/’(g), where ip maps a function 
/ : [ 0 , ly !->• M to a vector-valued function t/>(/) : [ 0 , 1 ]^ i-)- 
]g 2 M ^ functions. M can typically be quite small, since 
the kernel it approximates is one-dimensional. 


and aa{f) = {(fa, f) = /(O dt G M. Let V C 

be an appropriately chosen finite set of indices. If /, /' G 
L 2 ([ 0 , 1]^) are smooth and a{f) = (oa, (/),..., (/)), 

then 11/—/'||2 « ||a(/)—a(/')|p. Thus we can approximate 
cf as the squared distance between finite vectors: 

d'^{p,q) « Hip) - iPiq)\y 

^ M 

= \\Aip)-A{q)r (5) 

where A : L2([0,1]^) —>■ IR 2 M|v| concatenates the a fea¬ 
tures for each A function. That is, A{p) is given by 

^ {d{p^j,... ,d{p^j,dipij,... ,dipij). (6) 

We will discuss how to estimate d{px), d{px) shortly. 

Embedding RBF Kernels into 

The A features approximate the HDD (2) in thus 

applying the RKS embedding (Rahimi and Recht 2007) to 
the A features will approximate our generalized RBF kernel 
(1). The RKS embedding is' z : K™ —)• such that for 

fixed ■V(0, and for each x, y G M"*: 

zix)'^ziy) K. exp (-5^ ||a; - yf'^ , where 
zix) = yE (sin(u;7a:), cos(a;|/'x),...) . (7) 

Thus we can approximate the HDD kernel (1) as: 

K(p,q) = exp (^-^d^{p,q)^ 

~ exp {^^WMp) - ^(7)f ^ 

« z{Aip))'^z{A{q)). ( 8 ) 

'There are two versions of the embedding in common use, but 
this one is preferred (Sutherland and Schneider 2015). 
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Finite Sample Estimates 

Our final approximation for HDD kernels (8) depends on in¬ 
tegrals of densities p and q. In practice, we are unlikely to 
directly observe an input density, but even given a pdf p, the 
integrals that make up the elements of A{p) are not readily 
computable. We thus first estimate the density as p, e.g. with 
kernel density estimation (kde), and estimate A(p) as A{p). 
Recall that the elements of A{p) are: 

acc{p%)= [ ipa{t)pl^{t)dt (9) 

where j € M}, S' € {R,I},a € V. In lower di¬ 

mensions, we can approximate (9) with simple Monte Carlo 

numerical integration. Choosing Unif([0,1]^): 


estimation, to obtain a uniform error bound with a rate based 
on the function C~^ (Gine and Guillou 2002). We use the 
Fourier basis and choose V = (a € 
for parameters 0 <s< 7 , f>0. 

Then, for any Srks + (£kde + ea + etaii + Eint) < e, 
the probability of the error exceeding e is at most: 


2 exp (—-D£rks) + 2 exp (—Mef / (8Z^)) + 5 

/ 4 2/3/(2/3+r)\ 

+ ) + 2M (1 - m([0,...0)) 


-I- 8M \V\ exp 



( ^l + e^J(S\V\Z)-l 

\ V ^+1 


( 10 ) 

1—1 

obtaining A{p). We note that in high dimensions, one may 
use any high-dimensional density estimation scheme (e.g. 
Lafferty, Liu, and Wasserman 2012) and estimate (9) with 
MCMC techniques (e.g. Hoffman and Gelman 2014). 

Summary and Complexity 

The algorithm for computing features {z{A{pi))}^^ for 
a set of distributions {pi}fLi^ given sample sets {xi}f=i 

where Xi = £ [0, Pi, is thus: 

1. Draw M scalars Xj ^ and D/2 vectors lOr 
Af{0, cr~^l 2 M\v\)^ in 0{M \V\ D) time. 

2. For each of the N input distributions i: 

(a) Compute a kernel density estimate from xt, Pi{uj) for 
each Uj in (10), in 0{nine) time. 

(b) Compute A{pi) using a numerical integration estimate 
as in (10), in 0{M \V\ Ue) time. 

(c) Get the RKS features, z{A{pi)), in 0{M \V\ D) time. 

Supposing each rii x n, this process takes a total of 
O {Nnrie + N M\V\ne + N M\V\D) time. Taking \V\ 
to be asymptotically 0{n), Ue = 0{D), and M = 
0(1) for simplicity, this is 0{NnD) time, compared to 
0{N'^n log n+N^) for the methods of Poczos et al. (2012b) 
and 0{N'^n^) for Muandet et al. (2012). 


Theory 


We bound Pr 


( K{p, q) - z{A{p)yz{A{q)) 


> £ I for two 


fixed densities p and q by considering each source of error: 
kernel density estimation (ekde); approximating /i(A) with 
M samples (ea); truncating the tails of the projection coef¬ 
ficients (Etaii); Monte Carlo integration (Eint); and the RKS 
embedding (erks)- 

We need some smoothness assumptions on p and q\ that 
they are members of a periodic Holder class Sper(/3,L^), 
that they are bounded below by p* and above by p*, and that 


their kernel density estimates are in Sper( 7 , L) with proba¬ 
bility at least 1 — 5. We use a suitable form of kernel density 


where Wtaii = ^max (o, ^e^^ - i) . 

The bound decreases when the function is smoother 
(larger j3, 7; smaller L) or lower-dimensional (€), or when 
we observe more samples (n). Using more projection coeffi¬ 
cients (higher t or smaller s, giving higher |U|) improves the 
approximation but makes numerical integration more dif¬ 
ficult. Likewise, taking more samples from p (higher M) 
improves that approximation, but increases the number of 
functions to be approximated and numerically integrated. 

For the proof and further details, see the appendix. 

Numerical Experiments 

Throughout these experiments we use M = 5, \V\ = 10^ 
(selected as rules of thumb; larger values did not improve 
performance), and use a validation set (10% of the training 
set) to choose bandwidths for KDE and the RBE kernel as 
well as model regularization parameters. Except in the scene 
classification experiments, the histogram methods used 10 
bins per dimension; performance with other values was not 
better. The KL estimator used the fourth nearest neighbor. 

We evaluate RBE kernels based on various distances. First, 
we try our JS, Hellinger, and TV embeddings. We compare to 
L 2 kernels as in Oliva et al. (2014): exp (—l|P ~ ^lli) ~ 
z{a{p))~^z{d{q)) (L2). We also try the MMD distance (Muan¬ 
det et al. 2012) with approximate kernel embeddings: 

exp ^MMD(p, « z (z(p))^ z (z(g)), where z is 

the mean embedding z{p) = ^ 'YAi=i z{Xi) (MMD). We fur¬ 
ther compare to RKS with histogram JS embeddings (Vem- 
pati et al. 2010) (Hist JS); we also tried x^ embeddings, but 
their performance was quite similar. We finally try the full 
Gram matrix approach of Poczos et al. (2012b) with the KL 
estimator of Wang, Kulkami, and Verdti (2009) in an RBE 
kernel (KL), as did Ntampaka et al. (2015). 

Gram Matrix Estimation 

We first illustrate that our embedding, using the parameter 
selections as above, can approximate the Jensen-Shanon ker¬ 
nel well. We compare three different approaches to estimat¬ 
ing K{pi,pj) = exp(—^ 3S{pi,pj)). Each approach uses 
kernel density estimates pi. The estimates are compared on 
a dataset of = 50 random GMM distributions {pi}fc=i 


4 



samples of size n = 2 500: Xi = ^ [0, Pi- 

See the appendix for more details. 

The first approach approximates JS based on empirical es¬ 
timates of entropies E logpi. The second approach estimates 
JS as the Euclidean distance of vectors of projection coeffi¬ 
cients (5): JSpc(pi,Pj) = ||4(pi) —A{pj)\\‘^. For these first 
two approaches we compute the pairwise kernel evaluations 
in the Gram matix as G®"* = exp(—^ JSent(Pi,Pj)), and 
= exp(-^ JSpc(pi,Pj)) using their respective ap¬ 
proximations for JS. Lastly, we directly estimate the JS ker¬ 
nel with dot products of our random features (8): = 


z{A{pi)yz{A{pj)), with D = 7 000. 

Figure 2 shows the 
true pairwise kernel 
values versus the afore¬ 
mentioned estimates. 
Quantitatively, the en¬ 
tropy method obtained 
a squared correlation 
to the true kernel value 
of = 0.981; using 
the A features with an 
exact kernel yielded 
= 0.974; adding 
RKS embeddings gave 


Estimating JS Gram Matrix 



Figure 2: Estimating RBF with 
JS divergence. 


= 0.966. Thus our method’s estimates are nearly as 
good as direct estimation via entropies, while allowing us to 
work in primal space and avoid N x N Gram matrices. 


Estimating the Number of Mixture Components 

We will now illustrate the efficacy of HDD random fea¬ 
tures in a regression task, following Oliva et al. (2014): esti¬ 
mate the number of components from a mixture of truncated 
Gaussians. We generate the distributions as follows: Draw 
the number of components Yi for the ith distribution as ~ 
Unifjl,..., 10}. For each component select a mean ~ 
Unif[—5,5]^ and covariance -f 

where a ^ Unif[l, 4], A^^\u, v) ^ Unif[—1,1], and is 

a diagonal 2x2 matrix with {u, u) ~ Unif [0,1]. Then 
weight each component equally in the mixture. Given a sam¬ 
ple Xi, we predict the number of components Yi. An exam¬ 
ple distribution and sample are shown in Figure 3; predicting 
the number of components is difficult even for humans. 



Sample with 9 Components 

1r 



0.2 


°0 0.2 0.4 0.6 0.8 1 


Figure 3: A GMM and 200 points drawn from it. 

Figure 4 presents results for predicting with ridge regres¬ 
sion the number of mixture components Yi, given a varying 
number of sample sets Xi, with \xi\ € {200, 800}; we use 


D = 5 000. The HDD-based kernels achieve substantially 
lower error than the L 2 and MMD kernels. They also outper¬ 
form the histogram kernel, especially with \xi\ — 200, and 
the KL kernel. Note that fitting mixtures with EM and se¬ 
lecting a number of components using AlC (Akiake 1973) or 
BIC (Schwarz 1978) performed much worse than regression; 
only AlC with |xi| = 800 outperformed a constant predictor 
of 5.5. Linear versions of the L 2 and MMD kernels were also 
no better than the constant predictor. 

The HDD embeddings were more computationally expen¬ 
sive than the other embeddings, but much less expensive 
than the KL kernel, which grows at least quadratically in 
the number of distributions. Note that the histogram em¬ 
beddings used an optimized C implementation (Vedaldi and 
Fulkerson 2008), as did the KL kernel^, while the HDD em¬ 
beddings used a simple Matlab implementation. 

Image Classification 

As another example of the performance of our embeddings, 
we now attempt to classify images based on their distribu¬ 
tions of pixel values. We took the “cat” and “dog” classes 
from the ClFAR-10 dataset (Krizhevsky and Hinton 2009), 
and represented each 32 x 32 image by a set of triples 
{x, y, v), where x and y are the position of each pixel in the 
image and v the pixel value after converting to grayscale. 
The horizontal reflection of the image was also included, so 
each sample set Xi C had \xi\ = 2 048. This is certainly 
not the best representation for these images; rather, we wish 
to show that given this simple representation, our HDD ker¬ 
nels perform well relative to the other options. 

We used the same kernels as above in an SVM classi¬ 
fier from LIBLINEAR (Fan et al. 2008, for the embeddings) 
or LIB SVM (Chang and Lin 2011, for the KL kernel), with 
D = 7 000. Figure 5 shows computation time and accuracy 
on the standard test set (of size 2 k) with 2.5 k, 5k, and IOk 
training images. Our JS and Hellinger embedding approxi¬ 
mately match the histogram JS embedding in accuracy here, 
while our TV embedding beats histogram JS; all outperform 
L 2 and MMD. We could only run the KL kernel for the 2.5 k 
training set size; its accuracy was comparable to the HDD 
and histogram embeddings, at far higher computational cost. 

Scene Classification 

Modem computer vision classification systems typically 
consist of a deep network with several convolutional and 
pooling layers to extract complex features of input images, 
followed by one or two fully-connected classification layers. 
The activations are of shape nxhx w, where n is the num¬ 
ber of filters; each unit corresponds to an overlapping patch 
of the original image. We can thus treat the final pooled acti¬ 
vations as a sample of size hw from an n-dimensional distri¬ 
bution, similarly to how Poczos et al. (2012b) and Muandet 
et al. (2012) used SIFT features from image patches. Wu, 
Gao, and Liu (2015) set accuracy records on several scene 
classification datasets with a particular ad-hoc method of ex¬ 
tracting features from distributions (D3); we compare to our 
more principled alternatives. 
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(a) Samples of size 200. 
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(b) Samples of size 800. 


Figure 4: Error and computation time for estimating the number of mixture components. The three points on each line corre¬ 
spond to training set sizes of 4 k, 8k, and 16 k; error is on the hxed test set of size 2 k. Note the logarithmic scale on the time 
axis. The KL kernel for \xi\ = 800 with 16 k training sets was too slow to run. AlC-based predictions achieved RMSEs of 2.7 
(for 200 samples) and 2.3 (for 800); BIC errors were 3.8 and 2.7; a constant predictor of 5.5 had RMSE of 2.8. 



Figure 5: Misclassification rate and computation time for clas¬ 
sifying ClFAR-10 cats versus dogs. The three points on each 
line show training set sizes of 2.5k, 5k, and 10k; error is on 
the fixed test set of size 2k. Note the linear scale for time. The 
KL kernel was too slow to run for 5k or 10k training points. 


We consider the Scene-15 dataset (Lazebnik, Schmid, and 
Ponce 2006), which contains 4485 natural images in 15 
location categories, and follow Wu, Gao, and Liu in ex¬ 
tracting features from the last convolutional layer of the 
imagenet-vgg-verydeep-l6 model (Simonyan and 
Zisserman 2015). We replace that layer’s rectihed linear ac¬ 
tivations with sigmoid squashing to [0,1].^ hw ranges from 
400 to 1 000. There are 512 filter dimensions; we concate¬ 
nate features A(j)i) extracted from each independently. 

We train on the standard for this dataset of 100 images 
from each class (1500 total) and test on the remainder; Fig¬ 
ure 6 shows results. We do not add any spatial information 
to the model; still, we match the best prior published perfor¬ 
mance of 91.59 ± 0.48, which trained on over 7 million ex¬ 
ternal images (Zhou et al. 2014). Adding spatial information 

^We used piecewise-linear weights before the sigmoid function 
such that 0 maps to 0.5, the 90th percentile of the positive observa¬ 
tions maps to 0.9, and the 10th percentile of the negative observa¬ 
tions to 0.1, for each filter. 



Figure 6; Mean and standard deviation of accuracies on the 
Scene-15 dataset in 10 random splits. The left, black lines use 
A(-) features; the right, blue lines show z{A{-)) features. MMD 
methods vary bandwidth are relative to tr, the median of pair¬ 
wise distances; histogram methods vary the number of bins. 


brought the D3 method slightly above 92% accuracy; their 
best hybrid method obtained 92.9%. Using these features, 
however, our methods match or beat MMD and substantially 
outperform D3, L 2 , and the histogram embeddings. 

Discussion 

This work presents the first nonlinear embedding of density 
functions for quickly computing HDD-based kernels, includ¬ 
ing kernels based on the popular total variation, Hellinger 
and Jensen-Shanon divergences. While such divergences 
have shown good empirical results in the comparison of den¬ 
sities, nonparametric uses of kernels with these divergences 
previously necessitated the computation of a large N x N 
Gram matrix, prohibiting their use in large datasets. Our em¬ 
beddings allow one to work in a primal space while using in¬ 
formation theoretic kernels. We analyze the approximation 
error of our embeddings, and illustrate their quality on sev¬ 
eral synthetic and real-world datasets. 
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Appendices 

Gram Matrix Estimation 

We illustrate our embedding’s ability to approximate the 
Jensen-Shanon divergence. In the examples below the densi¬ 
ties considered are mixtures of five equally weighted truncated 
spherical Gaussians on [0,1]^. That is, 

1 ® 

= 5 y^A't(mjj-,diag(4j)) 


where ~ Unif([0,1]^), Sij ~ Unif([0.05,0.15]^) and 
Mt{rn, s) is the distribution of a Gaussian truncated on [0,1]^ 
with mean parameter m and covariance matrix parameter s. 

We work over the sample set {xi}iLiy where Xi = ^ 

[0,1]^}”=! Pi, n = 2500, N = 50. 

We compare three different approaches to estimating 
K(j)i,pj) = exp(— 2 ^ JS(pi,pj)). Each approach uses den¬ 
sity estimates pi, which are computed using kernel density esti¬ 
mation. The first approach is based on estimating JS using em¬ 
pirical estimates of entropies: 


= — 


log 


[»/2l 

m—1 

\n/2-] 

+ 5 log 

m—1 

\n/2-] 

+ 5 £ log 

m—1 

~ JSent {pi ! Pj )) 


Piix) 

log 


l^pj 


log 


Pj{x) 


Pi[x)+Pj{x) 

\n/2\ 


k{x, 


7)^ 


- 2 5] log 


m—1 


Pj{X: 


U), 


pdX^^)+Pj{X^^) 


p^ix^^)+PJ{x^J:^) 


where density estimates pi above are based on points 
{Xm}'^^\^n/ 2 '\+i lo avoid biasing the empirical means. The 
second approach estimates JS as the Euclidean distance of vec¬ 
tors of projection coefficients: 

JS(p*,Pj)« wMpi) - Mpo)t = JSpcbi.Pj), 

where here the density estimates pi are based on the entire 
set of points Xi- We build a Gram matrix for each of these 
approaches by setting G™* = exp(—^ JSent(Pi,Pj)) and 
= exp(— 2 ^ JSpc(pi,Pj)). Lastly, we directly estimate 
the JS kernel with random features: 


Gij’' = z{A{pi))^ z{A{pj)). 

We compare the effectiveness of each approach by comput¬ 
ing the score of the estimates produced versus a true JS 


Estimating JS Gram Matrix 



Figure 7: Estimating RBF values using the JS diveregence. 


kernel value computed through numerically integrating the true 
densities (see Figure 7 and Table 2). The RBF values estimated 
with our random features produce estimates that are nearly as 
good as directly estimating JS divergences through entropies, 
whilst allowing us to work over a primal space and thus avoid 
computing a N x N Gram matrix for learning tasks. 

Table 2: values of estimates of JS Gram elements. 


Method 

i?2 

Entropies 

0.9812 

PCs 

0.9735 

RKS 

0.9662 


Proofs 

We will now prove the bound on the error probability of our em¬ 
bedding Pr q) — z(A(p))^ 2 ;(A(g))| > ej for fixed den¬ 

sities p and q. 

Setup We will need a few assumptions on the densities: 

1. p and q are bounded above and below: for x £ [0,1]^, 0 < 
P* < p{x), q{x) < p* < oo. 

2. p,q £ T,{P,Lp) for some P,L^ > 0. E(/3,L) refers 
to the Holder class of functions / whose partial deriva¬ 
tives up to order [_P\ are continuous and whose rth partial 
derivatives, where r is a multi-index of order [/3J, satisfy 
\D^ f{x) — O'" f{y)\ < L\\x — y\\^. Here 'iP\ is the greatest 
integer strictly less than p. 

3. p, q are periodic. 

These are fairly standard smoothness assumptions in the non- 
parametric estimation literature. 

Let 7 = min(/3,1). If /? > 1, then p,q £ E(l, L^) for some 
L-y; otherwise, clearly p,q £ E(/3, Lp). Then, from assumption 
3, p,q £ Eper( 7 , 1 / 7 ), the periodic Holder class. We’ll need this 
to establish the Sobolev ellipsoid containing p and q. 
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We will use kernel density estimation with a bounded, con¬ 
tinuous kernel so that the bound of Gine and Guillou (2002) 
applies, with bandwidth h x logn, and truncating 

density estimates to [p*, p*\. 

We also use the Fourier basis (pa = exp ^ 2 i 7 ra^a:^, and de¬ 
fine V as the set of indices a s.t. < t for parame¬ 

ters 0 < s < 1 , f > 0 to be discussed later. 


Tail truncation error The third term of (11), the error 
due to truncating the tail projection coefficients of the pf 
functions, requires a little more machinery. First note that 

|||v>(p) - - \\A{p) - I is at most 

M 2 

^ -' 7 a)| • ( 12 ) 

j = l S=R,I a(f_V 


Decomposition Let rcr(A) = exp (—A^/(2o-^)). Then 

\K{p,q) - z{A{p))'^ z{A{q))^ < 

\K{p,q)-ra„ (||i(p) -i((j)||)| 

+ \r<Tk (P(P) - 4(<j)||) - z{A{p))'^ z{A{q))\,. 

The latter term was bounded by Rahimi and Recht (2007). For 
the former, note that is -Lipschitz, so the first term is 

at most— \d{p,q)— ||A(p) — yl(g) || [. Breaking this up with 
the triangle inequality: 

\d{p,q) - \\A{p)-A{q)\\ \ < 

\d{p,q) - d{p,q)\ -|- \d{p,q) - \\ip{p) - i>{q)\\\ 

+ mp)-m\\-\\A{p)-A{q)\\\ 

+ \\\A{p)-Am-\\Mp)-Mm\- (11) 

Estimation error Recall that d is a metric, so the reverse 
triangle inequality allows us to address the first term with 

\d{p,q) - d{p,q)\ < d{p,p) +d{q,q). 


Let W{s,L) be the Sobolev ellipsoid of functions 
such that Eaez^ = 

where is still the Fourier basis. Then Lemma 14 of Krish- 
namurthy et al. (2014) shows that Eper( 7 ,L 7 ) C W{s,L') for 
any 0 < s < 7 and L' = fL^(27r)“^L^J 4 =^ 4 ^- 

So, suppose that p,q € Eper( 7 , L) with probability at least 
1 — (5. Since x >->■ * 2 +*^ is ^ 2 ^+^^ -Lipschitz on [p*, 00 ), pf £ 

Sper ^ 7 , 2 a /1 + 4A2 L and so pf - qf is in Wjs, (1 -I- 

4A^)L') for s < 7 and L' = £L^pT^/(l — 

Recall that we chose V to be the set of a G such 
that Ei=ilajf^ < Thus Ea^t/I““(PA “ ^a)!^ < 

~ dx)\‘^ /^ < (1 + 4:X^)L'jt. 

The tail error term is therefore at least e with probabil¬ 
ity no more than <5-1-2 E^i Pr (^(1 + 4A|)L'/1 > £^/( 2 AT)^ . 
The latter probability, of course, depends on the choice of 
HDD d. Letting ^ = te‘^/{8ML') — it is 1 if ^ < 0 and 
1—p ([0, v^]) /Z otherwise. If C > 0, squared Hellinger’s prob¬ 
ability is 0, and total variation’s is ^ arctan(v^). A closed form 
for the cumulative distribution function for the Jensen-Shannon 
measure is unfortunately unknown. 


For d^ the total variation, squared Hellinger, or Jensen-Shannon 
HDDs, we have that d^{p,q) < TV(p,p) (Lin 1991). More¬ 
over, as the distributions are supported on [0,1]^, TVjp,)?) = 

\ IIp-pIIi < 2 lb-p|loo- 

It is a consequence of Gine and Guillou (2002) that, for 
any 5 > 0, Pr (^||p - 35 ||^ > ) < d for some Cs 

depending on the kernel. Thus Pr(|ci(p, < 7 ) — d{p,q)\ > e) < 

„„_1 /£'1„2/3/(2P+C) \ 

( 4iogn j. where =a:. 


A approximation The second term of (11), the approxi¬ 
mation error due to sampling As, admits a simple Hoeffd- 
I l |2 II l |2 

ing bound. Note that p;^ “7!? + Pa “ 'Ia ’ viewed as 

a random variable in A only, has expectation d^ {p, q) and is 
bounded by [0,4Z] (where Z = dp(A)): write it as 

Z J|p(a:) 2 +*-’' — (jjx) 2 +’^|^ dx, expand the square, and use 
f ^yp(x)q(x)dx < 1 (via Cauchy-Schwarz). 

For nonnegative random variables X and Y, 
Pr(|X-y|>e) < Pr(|x 2 -y 2 | >^ 2 ^^ 

have that Pr (| ||V’( 35 ) —'i/'(( 7 )|| — d(p, <j)| > e) is at most 
2exp(-Me^/(8Z2)). 


Numerical integration error The final term of (11) also 
bears a Hoeffding bound. Define the projection coefficient dif¬ 
ference Af ^(p, q) = aa\{pf) — aaiqf), and A similarly but 

with a. Then |||A(p) — A(<j)||^ — ||A(j5) — A(( 7 )||^| is utmost 


M 


E E E 

j = l S=R,I aeV 



Af ,Aj (p, q) 


(13) 


Letting e(p) = aa{p\) — aa{pf), each summand is at 
most (e(p) -f i{q)f + 2 | Af_^( 33 , g)j (e(p) -f i{q)). Also, 

a(p>'?)| ^ 2\/Z, using Cauchy-Schwarz on the integral 
and ||<pa ||2 = 1- Thus each summand in (13) can be more than 
e only if one of the es is more than \JZ e/4 — -sAZ. 


Now, using (10), aa{pf) is an empirical mean 
of Ue independent terms, each with absolute value 
bounded by (yb* + 1 ) maxa;|<pa(a;)| = Vp* + 1 . 
Thus, using a Hoeffding bound on the es, we get that 
Pr (|||A(p) — A{q)\A — ||A(p) — A(( 7 )|b| > e) is no more than 
(^z+sy{8S)-/z) 


8M S exp — 


2Zi^+lY 
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Final bound Combining the bounds for the decomposition 
(11) with the pointwise rate for RKS features, we get: 


Pr 


\k{p, q) - z{A{p)'^z{A{q))^ - - 


+ 2 exp (-Met/isz"^)^ 


+ (5 + 2M 1 — ^ 0, W max 0 


P*t4ii 47 - 4« 1 


+ 8M \V\exp —^ne 


47 

^1+4/(8|C4)-1 



for any erks + (ekde + £a + £taii + £int) ^ £■ 
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